Skip to content

fix: avoid blocking unrelated sessions during pool initialization (#256)#257

Merged
thepagent merged 4 commits intoopenabdev:mainfrom
clsung:bugfix/session_pool
Apr 17, 2026
Merged

fix: avoid blocking unrelated sessions during pool initialization (#256)#257
thepagent merged 4 commits intoopenabdev:mainfrom
clsung:bugfix/session_pool

Conversation

@clsung
Copy link
Copy Markdown
Contributor

@clsung clsung commented Apr 12, 2026

Refactor SessionPool to move expensive AcpConnection initialization outside of the write lock. This ensures that a single slow or stuck session startup does not block access to other active sessions in the pool.

Changes:

  • Wrap AcpConnection in Arc<Mutex<...>> for granular per-session locking.
  • Switch SessionPool to use read locks for connection retrieval.
  • Perform connection spawning and initialization before acquiring the pool's write lock.
  • Use a fast-path read lock to check session health before attempting a rebuild.
  • fixes bug: SessionPool lock scope crosses await and can stall prompt handling #256

Discord Discussion URL: https://discord.com/channels/1491295327620169908/1491365150664560881/1493769615984164975

@clsung
Copy link
Copy Markdown
Contributor Author

clsung commented Apr 15, 2026

Rebased this branch onto current openabdev/openab main and ported the fix onto the post-#310 PoolState design.

What changed:

  • moved per-session access to Arc<tokio::sync::Mutex<AcpConnection>>
  • dropped the global pool/state lock before long async work in get_or_create()
  • changed with_connection() to clone the per-connection handle under a read lock, then lock only that session
  • kept #310 suspend/resume behavior instead of reverting it
  • kept cleanup non-blocking by probing per-connection locks outside the global state lock
  • fixed two ABA-style races in eviction/cleanup by removing entries only when the sampled Arc handle still matches the current map entry

Validation:

  • cargo test passes on the rebased branch
  • added focused unit tests for the identity-checked removal helper
  • current result: 42 passed; 0 failed

This PR now targets the lock-scope bug from #256 without discarding the merged #310 pool changes.

@thepagent thepagent added the p1 High — address this sprint label Apr 15, 2026
@github-actions github-actions bot removed the closing-soon PR missing Discord Discussion URL — will auto-close in 3 days label Apr 15, 2026
Copy link
Copy Markdown
Contributor

@masami-agent masami-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review — PR #257

Thanks for the thorough refactor, @clsung. This addresses a real p1 issue — the current pool holds a write lock across AcpConnection::spawn() + initialize() + session_new(), which are all async and can take seconds. During that time, every other thread is blocked.

Architecture change

Before After
Connection storage HashMap<String, AcpConnection> HashMap<String, Arc<Mutex<AcpConnection>>>
Pool lock during spawn Write lock held across spawn + init + session_new Read lock for snapshot, no lock during spawn, write lock only for final insert
with_connection Write lock on pool state Read lock on pool state + per-connection mutex
Eviction Under write lock, direct access Snapshot + try_lock + remove_if_same_handle
Cleanup Under write lock Snapshot + try_lock + deferred removal

This is a significant improvement. The key insight — moving expensive async work outside the pool lock — is correct and well-executed.

✅ What looks good

  1. remove_if_same_handle — Smart use of Arc::ptr_eq to prevent ABA problems. If another task replaced the connection between snapshot and removal, the stale handle is not removed. Well-tested with two unit tests.

  2. with_connection now uses read lock — This is the biggest win. Previously, a streaming prompt held a write lock on the entire pool. Now it only holds a per-connection mutex, so other threads are unblocked.

  3. cleanup_idle uses try_lock — Busy connections are skipped instead of blocking the cleanup sweep. Correct behavior: a busy session is by definition not idle.

  4. Race condition handling in get_or_create — After spawning outside the lock, re-checks if another task already created a healthy connection for the same thread. Good.

  5. HRTB fix on with_connectionfor<'a> FnOnce(&'a mut AcpConnection) is the correct lifetime bound for the new pattern where the mutex guard (not the pool state) owns the borrow.

🔴 Must fix before merge

1. Eviction can fail silently, leaving pool permanently full

if state.active.len() >= self.max_sessions {
    if let Some((key, expected_conn, _, sid)) = eviction_candidate {
        if remove_if_same_handle(&mut state.active, &key, &expected_conn).is_some() {
            // ...
        }
    }
}

if state.active.len() >= self.max_sessions {
    return Err(anyhow!("pool exhausted ({} sessions)"));
}

If eviction_candidate is None (all other connections were locked via try_lock) or remove_if_same_handle returns None (connection was replaced), the pool stays full and returns an error. This is correct as a safety net, but the original code would always find an oldest entry because it had direct access under the write lock.

In the new design, under high concurrency, try_lock could fail on all connections, making eviction impossible even though idle sessions exist. Consider falling back to lock().await on the oldest candidate if try_lock fails on all entries, or at least logging a warning when eviction fails due to lock contention.

2. saved_session_id from suspended map is read under read lock but removed under write lock later — potential double-resume

let (existing, saved_session_id) = {
    let state = self.state.read().await;
    (
        state.active.get(thread_id).cloned(),
        state.suspended.get(thread_id).cloned(),  // read, not removed
    )
};

The suspended session ID is read (cloned) but not removed from the map. Later, state.suspended.remove(thread_id) happens under the write lock. But between the read and the write, another task for the same thread could also read the same suspended ID and attempt session_load with it. Two connections could try to resume the same session.

This is unlikely (same thread_id racing) but possible if a user sends two messages rapidly in the same thread. Consider either:

  • Accepting this as a known edge case (session_load is idempotent-ish)
  • Or adding a comment documenting the race window

🟡 Non-blocking

3. had_existing flag for session_reset

let had_existing = existing.is_some();
// ...
if !resumed {
    new_conn.session_new(&self.config.working_dir).await?;
    if had_existing || saved_session_id.is_some() {
        new_conn.session_reset = true;
    }
}

The original code only set session_reset when saved_session_id.is_some(). This PR adds had_existing as an additional condition. This is a behavioral change — if a stale connection existed but had no suspended session, session_reset is now set. Is this intentional? If so, a comment explaining why would help.

4. suspend_entry function removed

The suspend_entry helper is removed and its logic is inlined in multiple places. This is fine for now, but if more callers need suspension logic, consider extracting a helper again.

5. Connection drop timing

In the old code, suspend_entry explicitly drops the connection (triggering process group kill) inside the write lock. In the new code, the old Arc<Mutex<AcpConnection>> is dropped when remove_if_same_handle returns it and it goes out of scope. This should be fine since the Arc refcount will reach zero, but worth verifying that no other clone holds a reference that would delay the drop.

📝 Summary

Priority Item Status
🔴 Eviction can fail silently under lock contention Needs fix or documented fallback
🔴 Potential double-resume race on suspended session Needs comment or fix
🟡 had_existing behavioral change Needs clarification
🟡 suspend_entry inlined Non-blocking
🟡 Connection drop timing Non-blocking

The core design is sound and addresses the p1 issue well. The two 🔴 items are edge cases but worth addressing before merge given this is the session pool — the most critical concurrency component in the system.

Looking forward to the next revision!

@clsung
Copy link
Copy Markdown
Contributor Author

clsung commented Apr 15, 2026

Addressed the review follow-up on the latest revision:

  • #1 full-pool eviction under contention: kept the safe pool exhausted fallback, and now log explicit warnings when eviction fails because the sampled candidate changed or because all other sessions were busy during the eviction scan
  • #2 double-resume window: added a per-thread creation gate (creating) so same-thread get_or_create() / resume work is serialized and two rapid requests cannot race into duplicate session/load attempts
  • #3 session_reset behavior: added an inline comment documenting why stale live entries without a resumable session id still set the reset banner

Validation:

  • added a unit test for the per-thread gate helper
  • cargo test passes: 43 passed; 0 failed

Latest commit: 3fc2377

Copy link
Copy Markdown
Contributor

@masami-agent masami-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review — all items addressed ✅

@clsung addressed all three review items cleanly. Verified the full diff (not just the items I raised).

🔴 #1 — Eviction under contention

Kept pool exhausted as the safe fallback (correct — better to reject than corrupt state). Added two explicit warning logs:

  • "pool full but eviction candidate changed before removal" — ABA case
  • "pool full but all other sessions were busy during eviction scan" — all try_lock failed

Operators can now diagnose eviction failures from logs. Acceptable approach.

🔴 #2 — Double-resume race

Added a per-thread creation gate (creating: HashMap<String, Arc<Mutex<()>>>). The _create_guard at the top of get_or_create serializes same-thread calls, completely eliminating the race window. The write lock for get_or_insert_gate is brief (just a HashMap insert) and released before any heavy async work. Clean solution.

Unit test get_or_insert_gate_reuses_gate_for_same_thread confirms gate reuse via Arc::ptr_eq.

🟡 #3session_reset behavioral change

Inline comment added explaining the intent — stale live entries without a resumable session ID still warrant the reset banner. Makes sense.

Full diff verification

  • ✅ Only src/acp/pool.rs changed — no Cargo.toml/lock version issues
  • ✅ Lock ordering is correct: write lock for gate insert → release → per-thread gate → read lock for snapshot → no lock during spawn/init → write lock for final insert
  • cleanup_idle and with_connection patterns are consistent with get_or_create
  • ✅ 3 unit tests covering remove_if_same_handle (match + mismatch) and get_or_insert_gate
  • cargo test: 43 passed, 0 failed

Recommend squash merge.

@masami-agent
Copy link
Copy Markdown
Contributor

@clsung Two items to fix before this can merge:

1. CI fail — clippy type_complexity

error: very complex type used
--> src/acp/pool.rs:103
|   let mut eviction_candidate: Option<(String, Arc<Mutex<AcpConnection>>, Instant, Option<String>)> = None;

Simplest fix: add a type alias, e.g.

type EvictionCandidate = (String, Arc<Mutex<AcpConnection>>, Instant, Option<String>);

then use Option<EvictionCandidate>.

2. Version regression — Cargo.toml
PR branch has version = "0.7.4" but main is already at 0.7.5. Please rebase or update to match main.

@clsung
Copy link
Copy Markdown
Contributor Author

clsung commented Apr 15, 2026

Addressed the two latest maintainer follow-ups:

  • fixed the clippy type_complexity complaint in src/acp/pool.rs by introducing an EvictionCandidate type alias
  • synced the crate version to current main (0.7.5) in Cargo.toml and the openab package entry in Cargo.lock

Validation:

  • cargo test passes: 43 passed; 0 failed
  • cargo clippy --all-targets -- -D warnings passes

Latest commit: deafa0e

@masami-agent
Copy link
Copy Markdown
Contributor

@clsung Two items before we can merge:

  1. Merge conflict — please rebase onto latest main
  2. Version regression — PR branch is at 0.7.5 but main is now 0.7.7. Please sync the version after rebase.

@obrutjack
Copy link
Copy Markdown
Collaborator

Code logic reviewed and looks solid — Arc<Mutex<AcpConnection>> per-connection locking, creating gate for same-thread serialization, and remove_if_same_handle with Arc::ptr_eq are all correct patterns.

Two blockers before I can approve:

  1. Rebase onto main — currently has merge conflicts
  2. Sync Cargo.toml version — PR is at 0.7.5, main is 0.7.7

Once rebased with CI green and no version regression, I'll approve and we can proceed to merge.

clsung added 4 commits April 16, 2026 17:02
…enabdev#256)

Refactor SessionPool to move expensive AcpConnection initialization
outside of the write lock. This ensures that a single slow or stuck
session startup does not block access to other active sessions in
the pool.

Changes:
- Wrap AcpConnection in Arc<Mutex<...>> for granular per-session locking.
- Switch SessionPool to use read locks for connection retrieval.
- Perform connection spawning and initialization before acquiring the
  pool's write lock.
- Use a fast-path read lock to check session health before attempting
  a rebuild.
@clsung clsung force-pushed the bugfix/session_pool branch from deafa0e to e10f3d2 Compare April 16, 2026 09:06
@clsung
Copy link
Copy Markdown
Contributor Author

clsung commented Apr 16, 2026

Rebased bugfix/session_pool onto latest main and force-pushed the branch.

Current branch state:

  • based on latest canonical main
  • crate version synced to 0.7.7
  • latest commit: e10f3d2

Validation after rebase:

  • cargo test passes: 43 passed; 0 failed
  • cargo clippy --all-targets -- -D warnings passes

This should clear the remaining merge-conflict and version-regression blockers.

Copy link
Copy Markdown
Collaborator

@obrutjack obrutjack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge checklist verified:

  • ✅ CI all green (cargo check + clippy + 7 Docker smoke tests)
  • ✅ No merge conflicts
  • ✅ No version regression (0.7.7 matches main)
  • ✅ Clippy fixed (type alias for EvictionCandidate)
  • ✅ masami-agent approved
  • ✅ Code logic reviewed: Arc<Mutex> per-connection locking, creating gate, remove_if_same_handle — all correct

Pending code owner review from @thepagent to merge.

@thepagent
Copy link
Copy Markdown
Collaborator

Staging after next stable release.

@chaodu-agent
Copy link
Copy Markdown
Collaborator

Community Triage Notes — Edge Cases & Minor Concerns

Overall this PR is solid and correctly addresses the core lock-scope issue from #256. Recording a few edge cases and minor concerns here for future reference in case they come up.

1. with_connection has no timeout on per-connection lock

After dropping the pool read lock, with_connection calls conn.lock().await with no timeout. If a previous task holds the connection mutex indefinitely (e.g. a stuck prompt stream — the exact scenario #256 describes), the next same-thread request will block forever at this point. The creating gate serializes the create/resume path, but with_connection contention is not yet guarded. Worth considering adding tokio::time::timeout or a try_lock + retry pattern here if same-thread hangs resurface.

2. Failed init does not clean up stale active entry

If get_or_create detects a stale connection but then fails during spawn / initialize / session_new, the stale Arc<Mutex<AcpConnection>> remains in state.active. Subsequent calls will re-detect it as stale and retry, but the dead entry occupies a pool slot in the meantime. Consider removing the stale entry before attempting init, or cleaning it up in the error path.

3. creating gate map grows unbounded

PoolState::creating inserts an Arc<Mutex<()>> per thread_id and never removes them. Since pool size is capped this is unlikely to be a practical issue, but for long-running deployments with many distinct thread IDs, it might be worth pruning gates for threads no longer in active during cleanup_idle.

4. session_reset banner on never-alive connections

The condition had_existing || saved_session_id.is_some() will set session_reset = true even when the existing entry was never alive and has no resumable session ID. The false-positive reset banner is harmless but slightly imprecise semantically.


None of these are blockers — just documenting them so we have a reference if any of these edge cases surface in production.

@thepagent thepagent merged commit dbfa231 into openabdev:main Apr 17, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

p1 High — address this sprint

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: SessionPool lock scope crosses await and can stall prompt handling

6 participants